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Abstract 

Let X be a non-negative random variable and let the conditional distribution of a random variable Y, given 
X, be Poisson(7 • X), for a parameter 7 > 0. We identify a natural loss function such that: 

• The derivative of the mutual information between X and Y with respect to 7 is equal to the minimum mean 
loss in estimating X based on Y, regardless of the distribution of X. 

• When X ~ P is estimated based on Y by a mismatched estimator that would have minimized the expected 
loss had X ~ Q, the integral over all values of 7 of the excess mean loss is equal to the relative entropy 
between P and Q. 

For a continuous time setting where X T — {X t , < t < T} is a non-negative stochastic process and the conditional 
law of Y T = {Y t ,0 < t < T], given X T , is that of a non- homogeneous Poisson process with intensity function 
7 • X T , under the same loss function: 

• The minimum mean loss in causal filtering when 7 = 70 is equal to the expected value of the minimum mean 
loss in non-causal filtering (smoothing) achieved with a channel whose parameter 7 is uniformly distributed 
between and 70. Bridging the two quantities is the mutual information between X T and Y T . 

• This relationship between the mean losses in causal and non-causal filtering holds also in the case where the 
filters employed are mismatched, i.e., optimized assuming a law on X T which is not the true one. Bridging 
the two quantities in this case is the sum of the mutual information and the relative entropy between the 
true and the mismatched distribution of Y T . Thus, relative entropy quantifies the excess estimation loss 
due to mismatch in this setting. 

These results parallel those recently found for the Gaussian channel: the I-MMSE relationship of Guo Shamai 
and Verdii, the relative entropy and mismatched estimation relationship of Verdii, and the relationship between 
causal and non-casual mismatched estimation of Weissman. 

Index Terms- Causal estimation, Divergence, Girsanov transformation, I-MMSE, Mismatched estimation, Mu- 
tual information, Nonlinear filtering, Point process, Poisson channel, Relative entropy, Shannon theory, Statistics 



1 Introduction 

In the seminal paper [13] , Guo, Shamai and Verdii discovered that the derivative of the mutual information between 
the input and the output in a real- valued scalar Gaussian channel, with respect to the signal-to-noise ratio (SNR), 
is equal to the minimum mean square error (MMSE) in estimating the input based on the output. This simple 
relationship holds regardless of the input distribution, and carries over essentially verbatim to vectors, as well as the 
continuous-time Additive White Gaussian Noise (AWGN) channel (cf. [34], [2Tj for even more general settings where 
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this relationship holds). When combined with Duncan's theorem [7], it was also shown to imply a remarkable rela- 
tionship between the MMSEs in causal (filtering) and non-causal (smoothing) estimation of an arbitrarily distributed 
continuous-time signal corrupted by Gaussian noise: the filtering MMSE at SNR level 7 is equal to the mean value 
of the smoothing MMSE with SNR uniformly distributed between and 7. The relation of the mutual information 
to both types of MMSE thus served as a bridge between the two quantities. 

More recently, Verdu has shown in [31 that when X ~ P is estimated based on Y by a mismatched estimator 
that would have minimized the MSE had X ~ Q, the integral over all SNR values up to 7 of the excess MSE due to 
the mismatch is equal to the relative entropy between the true channel output distribution and the channel output 
distribution under Q, at SNR = 7. 

This result was key in [3J! , where it was shown that the relationship between the causal and non-causal MMSEs 
continues to hold also in the mismatched case, i.e. when the filters are optimized for an underlying signal distribution 
that differs from the true one. The bridge between the two sides of the equality in this mismatched case was shown to 
be the sum of the mutual information and the relative entropy between the true and mismatched output distributions, 
this relative entropy thus quantifying the penalty due to mismatch. 

Consider now the Poisson channel, by which we mean, for the case of scalar random variables, that X, the 
input, is a non- negative random variable while the conditional distribution of the output Y given the input is 
given by Poisson(7 • X), the parameter 7 > here playing the role of SNR. In the continuous time setting, the 
channel input is X T = {X t ,Q < t < T}, a non-negative stochastic process, and conditionally on X T , the output 
Y T — {It, < t < T} is a non- homogeneous Poisson process with intensity function 7 • X T . Often referred to as the 
"ideal Poisson channel" [15], this model is the canonical one for describing direct detection optical communication: 
The channel input represents the squared magnitude of the electric field incident on the photo-detector, while its 
output is the counting process describing the arrival times of the photons registered by the detector. Here the energy 
of the channel input signal is proportional to its l\ norm, rather than the 1% norm as in the Gaussian channel. Thus it 
is the amplification factor 7 rather than 7 2 that plays the role of SNR. We refer to [32 for a review of the literature on 
the Poisson channel and its communication theoretic significance, and to and references therein for applications 
of Poisson channel models in other fields. 

The function £q(x) = xlogx — x + 1, x > (where log denotes the natural logarithm throughout), being the 
convex conjugate of the Poisson distribution's log moment generating function, arises naturally in analysis of Poisson 
and continuous time jump Markov processes in a variety of situations. These include relative entropy representation 
for jump Markov processes (see, e.g., equation (3.20) and Theorem 3.3 of [5]), large deviation local rate function for 
such processes (jSJ, Chapter 5 of 29J), mutual information in the Poisson channel (Section 19.5 and equation (19.135) 
of [20]), and logarithmic transformations in stochastic control theory (Section 3 of 0). It is also intimately related 
to change-of-measure formulae for point processes in the spirit of the Girsanov transformation (Section VI. (5.5-6) of 
0> [EL |2B])- It is therefore not surprising that the function £ appears in this paper in representations for relative 
entropy and related calculations. It is less obvious, however, that using it to define estimation loss turns out to be 
very useful and, in particular, gives rise to a number of results that parallel the Gaussian theory. 

Enter the loss function £ : [0,oo) x [0, 00) —¥ [0, 00] defined by x£o(x/x) or, more precisely, 

£(x, x) — x \og{x/x) — x + x, (1) 

where the right hand side of (JlJ is well-defined as an extended non-negative real number in view of our conventions 
OlogO = 0, log 0/0 = 0, c/0 = 00 and logc/0 = 00 for c > 0. In Section[2j we exhibit properties of this loss function 
that show it is a natural one for measuring goodness of reconstruction of non-negative objects, and that it shares 
some of its key properties with the squared error loss, such as optimality of the conditional expectation under the 
mean loss criterion. 

The goal of this paper is to show that a set of relations identical to those that hold for the Gaussian channel 
- ranging from Duncan's formula [7] , to the I-MMSE of |T3J [33] , to Verdii's relationship between relative entropy 
and mismatched estimation |31j . to the relationship between causal and non-causal estimation in continuous time 
for matched [T3J and mismatched [33J filters - hold for the Poisson channel upon replacing the squared error loss by 
the loss function in ([!]). 

It is instructive to note that while the relative entropy between two Gaussians of the same variance and means 
mi and is equal to (m\ — m 2 ) 2 , that between two exponentials of parameters Ai and A2 is equal to ^(Ai,A2) 
(with additional multiplicative terms in both cases). Although this simple fact does not exclusively explain the 
Gaussian-Poissonian analogy, it lies at its heart, along with further properties of £ observed in Section [2] 
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Our emphasis is on the results for the mismatched setting, relating the cost of mismatch to relative entropy in 
the Poisson channel. The results for the exact (i.e., non-mismatched) setting, relating the minimum mean loss to 
mutual information, and causal to non-causal minimum mean estimation loss, are shown to follow as special cases. 
The latter results, for the exact setting, are consistent and in fact coincide with those of [T3] - which considered a 
more general Poisson channel model that accommodates the presence of dark current - when specialized to the case 
of zero dark current. Our framework complements the results of |14j not only in extending the scope to the presence 
of mismatch, but also in highlighting that the estimation theoretic quantities obtained in |14] have minimum mean 
loss interpretations paralleling those from the Gaussian setting. 

The remainder of the paper is organized as follows. Section [2] establishes some basic properties showing the loss 
function in ([!]) is a natural one, as discussed above. After introducing our standard notation and conventions for 
information measures in Section[3j we present the main results of this paper in Section[4j relating relative entropy and 
mismatched estimation for the Poisson channel, under the loss function in ([!]), for random variables and processes 
(causal and non-causal estimation in the latter case). In Section [5j we detail implications of our results: we show 
that they not only allow to recover known results from the non-mismatched Poisson channel setting such as some of 
those in [50] and [Uj, but also endow the latter with optimal estimation interpretations that allow to easily deduce 
results paralleling those that have been established in the Gaussian setting. We present additional consequences, 
such as an estimation theoretic representation of entropy, the relationship between causal and non-causal (matched 
and mismatched) estimation in continuous time, and some of its implications. Section [6] illustrates some of our 
key findings via a couple of simplistic examples where the underlying noise-free process is a DC signal. Section 
[7] is dedicated to proving our results. We end that section by indicating how some of the results carry over to 
accommodate the presence of feedback. An alternative route to proving the main results via a more elementary 
analysis is described in Section [HJ We conclude in Section [9] with a summary of our findings and some related future 
directions. 

2 A Natural Loss Function 

Throughout the paper we use £ to denote the loss function specified in Q. In this section, we collect a couple 
of lemmas suggesting that I is a natural function for measuring goodness of estimation of a non-negative random 
variable, and that it possesses some of the celebrated properties of squared error loss that make the latter so popular. 

Lemma 2.1 The function £ defined in |7p has the following properties: 

1. Non-negativity: £(x,x) > with equality if and only if x = x. 

2. Convexity: £{x, x) is convex in each of its arguments. 

3. Scaling: £(ax,ax) — a ■ £(x 1 x), a > 0. 

4- Unboundedness of loss for underestimation: For any x > 0, lim^_ > o+ ^{ x i x ) = ^( x j 0) = oo. 
One way of seeing why the first (non-negativity) property stated in the lemma holds is to recall that 

£(x,x)=x-£ (x/x) (2) 

and note that the function 

£ (x) — x log x — x + 1 (3) 

is a (convex) non- negative function assuming its unique minimum value of at x = 1, cf. Figure [l]-(b). We leave 
the elementary verification of the remaining properties collected in the lemma to the reader. Note that the first 
two properties in the lemma are enough to imply unboundedness of the loss for overestimation, i.e., that for any 
x > 0, lim^oo £{x, x) — oo. The fourth property shows that the same holds true for underestimation, a pleasing 
property when measuring goodness of reconstruction of non-negative quantities, not possessed by the more common 
loss functions such as absolute or squared error. 

To note another key property, recall the squared error loss function 

£ SE (x,x) = (x~x) 2 , (4) 
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which satisfies, for any random variable X of finite variance, 



E [e SE (X, x)} = E [t SE (X, EX)} + £ SE (EX, x). 
The same relationship holds under our present loss function as well: 

Lemma 2.2 For any non-negative random variable X with E[X\og + X] < oo, and any x € [0, oo), 

E [£{X, x)] = E [£(X, EX)} + £(EX, x). 
Proof: When X — a.s., EX — 0, and this identity follows from our conventions. Otherwise EX > and 

X 



(5) 

(6) 

(7) 

(8) 

(9) 
(10) 

One immediate consequence of ||6j), when put together with the first (non- negativity) property in Lemma 2.1 is the 
fact that EX uniquely minimizes E [£(X, x)] over all x: 



E[£(X,x)} 



E 



E 



Alog- 



= E 



X\og 
X\og 



X 



X EX 



X 



x EX 
X 
EX 

E[£(X,EX)]+£(EX,x). 



X + EX 



EX 

EX log — ; EX + x 

x 

□ 



mwE[£(X,x)] =E[£(X,EX)} = E[X log X] - E[X] \ogE[X], 



(11) 



and thus E[X\ogX] — E[X]\og E[X] plays a role analogous to that played by variance under squared error loss. 
An immediate consequence of (11) is that conditional expectation i5[A|F] is the unique estimator of X based on Y 
minimizing the mean loss not only under £sBi but also under I. This property is in fact common to (and characterizes) 
the family of Bregman loss functions [1] . 

Another key property shared by the loss function I and squared error loss, exhibited respectively in ([6]) and ([5]), 
is that, beyond quantifying the loss, it also quantifies the price of mismatch, i.e., the excess loss due to using x in lieu 
of EX (cf., respectively, the second terms on the right hand sides of ^ and ([5])). This property, which in the case 
of squared error loss is due to orthogonality, is as key in the derivation of our results as the orthogonality principle 
is for deriving the results in [T3J [3TJ [33] . 



3 Notation and Conventions 



Our conventions and notation for information measures, such as mutual information and relative entropy, are stan- 
dard. The initiated reader is advised to skip this section. If U, V, W are three random variables taking values in 
Polish spaces U, V, W, respectively, and defined on a common probability space with a probability measure P, we let 
PuiPuy etc- denote the probability measures induced onW, the pair (U, V) etc. while e.g., Pu\v denotes a regular 
version of the conditional distribution of U given V. Pjj\ v is t ne distribution on U obtained by evaluating that regular 
version at v. If Q is another probability measure on the same measurable space we similarly denote Qu, Qu\Vi e ^ c - 
As usual, given two measures on the same measurable space, e.g., P and Q, define their relative entropy (divergence) 

by 

f r dp~ 

D(P\\Q)- ' 



log 



dQ 



dP 



when P is absolutely continuous w.r.t. Q, defining D(P\\Q) = oo otherwise, 
definitions of relative entropy and of the Radon-Nikodym derivative is that if / 
one, and V = f(U), then 

D(Pu\\Qu) =D(P V \\Q V ). 

Following [5], we further use the notation 



(12) 

An immediate consequence of the 
: U — > V is measurable and one-to- 

(13) 



D{Pu\v\\Qu\v\Pv) 



D(P ulv \\Q ulv )dP v (v), 



(14) 
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where on the right side D(Pu\ v \\Qu\ v ) is a divergence in the sense of (12) between the measures Pjj\v and Qu\v It 
will be convenient to write 

D(Pu\v\\Qu\v) (15) 

to denote f(V) when f(v) = D{P U \ [V \\Q U \ [V ). Thus D(Pu\v\\Qu\v) is a random variable while D{P u \ [ y\\Q u \ [ y\Py) is 
its expectation under P. With this notation, the chain rule for relative entropy (cf., e.g., [BJ Subsection D.3]) is 

D(Pu,v\\Qu,v) = D(Pu\\Qu) + D(P vlu \\Q vlu \Pu) (16) 

and is valid regardless of the finiteness of both sides of the equation. 
The mutual information between U and V is defined as 

I(U;V) = D(P UtV \\PuxP v ), (17) 

where Pjj x Py denotes the product measure induced by Pjj and Py . We note in passing, in line with the comment 



on relative entropy and one-to-one transformations leading to (13 1, that if / and g are two measurable one-to-one 
transformations and A = f(U) while B = g(V), then 

I(U;V)=I(A;B). (18) 

Finally, the conditional mutual information between U and V, given W , is defined as 

I(U; V\W) = D(P uylw \\P ulw x P vlw \P w ). (19) 

The roles of U, V, W will be played in what follows by scalar random variables, vectors, or processes. 

4 Relative Entropy and Mismatched Estimation 
4.1 Random Variables 

Suppose that X is a non-negative random variable and the conditional law of a r.v. Yy, given X, is Poisson(7X). If 
X ~ P, denote expectation w.r.t. the corresponding joint law of X and Yy by Ep, the distribution of Y 7 by Py^, 
the conditional expectation by -Ep[X|F 7 ], etc. We denote the mutual information by Ip(X;Y 1 ) or simply I(X;Y y ) 
when there is no ambiguity. Let further mlep^^) denote the mean loss under I in estimating X based on Yy using 
the estimator that would have been optimal had X ~ Q when in fact X ~ P, i.e., 

m[e F , Q ( 7 ) = E P [£(X,E Q [X\Y 7 ])] . (20) 
The following is a new representation of relative entropy, paralleling the Gaussian channel result of |31) : 
Theorem 4.1 For any pair P, Q of probability measures over [a, b], where < a < b < oo, 

POO 

D(P\\Q)= / [mle P . Q (7)-mlep,p(7)]d 7 (21) 
Jo 

Theorem 4.1 is a direct consequence of the fact (proved in Section [7]) that 



lim D(P Y JQ Y J = D(P\\Q), (22) 

7— >oo 1 1 



combined with the following result, which is the Poisson parallel of [3U Equation (24)]: 
Theorem 4.2 For any P,Q as in Theorem \4-l\ and for any 7 > 0, 

n 



D(PyJQ y ^) = f 1 [mlep Q (a) - m \e P . P {a)]da. (23) 
Jo 



To note one immediate implication of Theorem 4.2 the non-negativity of the integrand on the right hand side of ( |23[), 
as follows from ([6|, implies that D(Py^ WQy.,) increases with 7. Additional implications are pointed out in Section |5j 



5 



4.2 Continuous-Time Stochastic Processes 



Fix T > 0. Denote by D the space of right-continuous paths with left limits from [0, T] to K. Endow D with the 
usual Skorohod topology [3J and denote by T> the Borel er-algebra of D. Denote by V the collection of probability 
measures P on (D, T>) under which for P-a.e. £ € D, £ is bounded between two positive constants. 

A measurable space {fi,F) is given, on which a stochastic process X T — {A t ,0 <t< T} and, for each 7 > 0, 
a stochastic process Y^ = {Yy it ,0 < t < T}, are given. These processes represent the signal and observation, 
respectively. The sample paths of each of them are in D, and each is assumed to be measurable as a map from 
[Q x [0,T], J"®B([0,T])) to (R + ,B(K + )) (where throughout, B(U) denotes the Borel cr-algebra of a metric space 
U). Given 7 > 0, we are interested in probability measures P on (J?, J 7 ) under which the measure induced by A 
on (D, T>) is in V, and Y^ is jointly distributed with X T in such a way that, given X T , Y^ is a non-homogenous 

Poisson process with intensity function 7 • X T . We denote by V 1 = V 1 {Q 1 J 7 ) the collection of all such measures. 

Given 7 > and a probability measure P on (D, T>) pick P G V 1 for which the law of X T on D is P, and let Ep 
denote expectation w.r.t. the joint law of X T and Y^ on D 2 under P. Although this law depends on 7, there is no 
need to indicate 7 in the notation Ep because Y^ itself depends on 7. Let mlep j g(7) denote the expected cumulative 
loss in non-causal filtering of X T based on Y^ when X T ~ P but the filter used is optimized for X T ~ Q, i.e., 



A 



mlep iQ (7) = E P 



[ i{X u E Q [X t \Y^])dt 
Jo 



(24) 



Note that the definitions in (20) and (24) are consistent with each other, and which of them applies is dictated 



unambiguously according to whether P and Q govern random variables or processes. Theorem 4.2 carries over to 
stochastic processes as follows: 



Theorem 4.3 Let P and Q be two probability measures that are members ofV. For 7 > 0, 



D(P y t II Q y t ) = / [mlep^a) — mlep.p(a)] da. 
Jo 



(25) 



Let now cmlep.Q(7) denote the expected cumulative loss in causal filtering of X T based on Y^ when X T ~ P 
but the filter used is optimized for X T ~ Q, i.e., 



cmlep, Q (7) = E P 



i{X u E Q [X t \Y*})dt 



(26) 



The right hand side of ( 24 ) and that of ( 26 ) differ only in that the conditional expectation appearing in the former 
has the entire process Y^ in the conditioning, while in the latter only Yi, the process up to time t. Our main result 
regarding cmlep.Q(7) is that its excess value above the mean filtering loss of the optimal causal filter is proportional 
to the relative entropy between P y t and Qy-r: 



Theorem 4.4 Let P and > 



be two probability measures that are members ofV. For 7 > 0, 
D(P y t\\Q y t) = 7 • [cmlep,Q(7) - cm\e P , P (<y)] . 



(27) 



Put together, Theorem 4.3 and Theorem |4.4| yield, for 7 > 



If 1 

cmlep, Q (7) - cmle P .p(7) = / [m\e PQ (a) - m\e PP (a)] da = -D(P y t\\Q y t), 

7 Jo ' 7 7 7 



(28) 



which is the Poissonian analogue of [331 Theorem 2]. On a technical note, the r.h.s. of (24), (25) and (|26|) are 



well-defined as integrals of non- negative Borel measurable functions, as will follow from our treatment in Section [7] 



5 Implications 

5.1 Mutual Information and Minimum Mean Estimation Loss 

Let A be a non-negative random variable and, for 7 > 0, let Y~ be a non-negative integer-valued random variable, 
jointly distributed with A such that the conditional law of Y 1 given A is Poisson(7X). When specialized to this 
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setting, Theorem 2 of [14] gives 



I{X- Y ) =E[XlogX- E[X\Y y ] log S[X|r 7 ]] 



(29) 



It is instructive to observe that the right hand side of ( 29 ) is nothing but the minimum mean loss in estimating X 
based on Y~ i under the loss function £. Indeed, denoting this minimum mean loss by mmle(7), i.e., 



A 



mm\e(<y) =E[l(X,E[X\YJ)], 



we have 



E[l{X,E[X\Y^\)\ = E 



Xlog- 



X 



-X + E[X\Y^ 



E[X\Y 7 ] 
E [X log X - X log E[X\Y 7 }} 
E [X log X - E[X\Y 7 ] log E[X\Yy] 



Thus, (J29J) can be stated as the "I-MMLE" relationship 

d 



/(X;y 7 ) = mmle( 7 ), 



(30) 



(31) 

(32) 
(33) 



(34) 



in complete analogy with the I-MMSE relationship of [13 . To see one immediate benefit of this realization that the 
right hand side of ( 29 ) coincides with the minimum mean loss in the right hand side of ( 34 ) , we first go through 
the following data processing argument: Fix 7' < 7, let {£?i}i>i be i.i.d. Bernoulli ( 7 '/ 7 ) independent of (X, YS), 

and note that (x, Y^iZi i s equal in distribution to (X, Yy). Since estimating X based on Y^iZi Bi, which is a 
function of Y n and the randomization sequence {-B;}, cannot be better (in the sense of minimizing the expected loss 
under £) than estimating X based on Yy, we have mmle(7') > mmle( 7 ). Thus, mmle(7) is non-increasing with 7 
which, when combined with (34), yields the following analogue of [121 Corollary 1]: 

Corollary 5.1 I(X;Yy) is concave in 7. 

It is also worth pointing out that the I-MMLE relationship can be viewed as a direct consequence of Theorem 



4.2 Indeed, in the notation of Section 4.1 (34) is expressed as 

rl I P (X-Y 1 ) = m\e P A 1 ), 



which can be inferred from: 



j P (X;r 7 ) = 



/ D(P Yylx=x \\P Yy )dP x (x) 

J J [m\e Sx ,p(a) - mle^^a)] da 



dPx(x) 



E Sx [e(x,E P [X\Y a ])]da 
Ep[£(X,E P [X\Y a })}da 
m\ep j p(a)da, 



dPx(x) 



(35) 

(36) 
(37) 
(38) 
(39) 
(40) 



where we use 5 X to denote the degenerate distribution on x and (a) follows by applying Theorem 4.2 on the integrand 
in (36). Note that (40) is nothing but the integral version of (35). 

A similar exercise - of expressing the mutual information between the channel input and output as a relative 
entropy between the distribution of the output conditioned on a particular channel input and the unconditioned 
channel output distribution, integrated over the channel input distribution, and using the relevant relationship 



7 



from Section [4] to express the integrand - can be performed in the continuous-time setting of Section |4.2| Indeed, 
application of Theorem |4.3| on the said integrand gives 



Ip(X t ;Y^) = [ mle P>i 
Jo 



(a)da. 



(41) 



The relationship (41) is consistent with [14, Theorem 4], the two in fact coinciding when specializing the latter to 
zero dark current. In a similar way, application of Theorem |4.4| gives 



I P (X T ;Y^) = 7 • cmle P! p( 7 ), 



which, as in (|33|), is seen to be equivalent to the known relationship 



7 ■ E P 



[ (X t \ogX t - E P [XM logE P [XM) dt 
Jo 



(42) 



(43) 



cf., e.g., [5U1 Subection 19.5]. The representation (42) highlights the optimal estimation interpretation of this rela- 
tionship and, through that, the close analogy with Duncan's theorem [7J Theorem 3]. 

Finally, going back to the setting of the "I-MMLE" relationship for scalar random variables, we note that the 
conditional entropy of the output given the input is given by 



H(YJX) = E 



fe=0 



k\ 



(44) 



so, in particular, is dependent on both 7 and the distribution of X, in contrast to the Gaussian channel setting 
of [T3] where the (differential) entropy of the output given the input depends on neither. Thus, while the mutual 
information can be replaced by the (differential) entropy of the channel output in the I-MMSE relationship of [13] , 
this is not the case for the I-MMLE relationship of our present setting, which further consolidates the role of mutual 
information rather than channel output entropy as key in the relations between information and estimation. 



5.2 Estimation Theoretic Representation of Entropy 

For a random variable X taking values in a countable set A, and a mapping g : A —¥ [0, 00), let Z 7 be jointly 
distributed with X such that the conditional law of Z 1 given X is Poisson(7 • g(X)). With benign abuse of notation, 



we write Poisson(7 • g(X)) in lieu of Z 7 when convenient. The following is a direct consequence of Theorem 4.1 



Corollary 5.2 For any discrete random variable X ~ P supported on the alphabet A, for any one-to-one mapping 
g : A —¥ [0, 00), and for any x £ A with P{X = x) > 0, denoting expectation w.r.t. P by E, 



log 



1 



E [£(g(X), E [. 9 (A)|Poisson( 7 • g(Xj)} )\X = x] dry. 



Proof: Apply Theorem 
l0g Q(X = x) 



4.1 



P(X = x) jo 

with P = 5 g r x ) and an arbitrary Q supported on g(A) to obtain 
1 



(45) 



log 



Q(g(X) = g(x)) 



E [i{g(x),E Q \g{X)\Poi3son{7 ■ g(X))} )\X = x] dry, (46) 



and note that (45) is nothing but (46) specialized to Q = P. □ 



Averaging over (45) with respect of P gives the following analogue to Theorem 13 of [13] : 



Corollary 5.3 For any discrete random variable X taking values in A, and for any one-to-one mapping g : A 
[0,oo), 



H(X) 



E [£(g(X),E[g(X)\Poisson( 1 ■ g(X))} )] dry. 



(47) 



Corollary 5.3 can also be deduced directly from (40) by noting that, for discrete X, I(X; Y 7 ) — > H(X) as 7 — > 00, and 



that if the discrete X takes values in the alphabet A then H(X) = H(g(X)) for any one-to-one mapping g : A — > K. 
It is interesting that the integrals on the right-hand sides of ( |45| and ( |47| do not depend on g, a fact that seems 
hard to deduce directly from estimation theoretic considerations. 
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5.3 Relationship Between Causal and Non-Causal Estimation and the 
Price of Mismatch in Continuous Time 

Our main findings for the continuous-time setting are summarized in the theorem that follows, relating the causal 
estimation error, the non-causal estimation error, the mutual information, and the relative entropy. It follows directly 
from combining ([28 1, (4l| and ( |42| ). 

Theorem 5.5 Let P and Q be two probability laws on the non-negative process X T that are members ofV. For any 
7>0, 

i /■ i 

(48) 



1 



7 Jo 



cmle P . Q ( 7 ) = - / m\e P . Q (a)da = - Ip(A^ ;YJ ) + D(P y t\\Q y t) 
I . . * 7 



We list a few observations that are implied for the continuous-time setting by Theorem 5.5 in a manner similar to 
that in which the observations in Section 2 of |33j are implied by the main result therein. We refer to that section 
for the details. 

• cmlep.Q(7) and mlep^^) do not necessarily decrease with 7: In the mismatched case, when P ^ Q, cmlepQ( 7 ) 
and mlepQ(7) need not decrease with 7, nor need the relationship cmle^g^) > mle^Q^) hold. The two 
properties in fact determine each other: 



^ & p,q{i) > cm l e P,Q(7) if an d on ly if 



d 



cmle Pi Q(7) > 0. 



(49) 



In words: an increase in SNR deteriorates the mismatched causal estimation performance if and only if the 
latter is better than the non-causal estimation performance. We give an extreme example of such a scenario in 
Section [631 



• Invariance of the Mismatched Filtering Performance to the Direction of Time: Let acmlepQ( 7 ) denote the 
mean estimation loss achieved by the anti-causal mismatched filter, i.e., 



acmlep iQ (7) = E P 



[ £(X t ,E Q [X t \{Y^ s -Y^ t } t < s < T })dt 
Jo 



(50) 



By the invariance of the mutual information and of the relative entropy in ( 48 1 to the direction of the flow of 
time (apply, respectively, (18) and (13) with the role of the transformations / and g played by time reversal), 
we obtain 

acmlep Q (7) = cmlep Q ( 7 ). (51) 



It is remarkable that (51) holds under essentially no restrictions on P, on Q, or on the relationship between 
them. 

• Factor of 2 Relationship at Low SNR: Assuming that at 7 = cmlep^^) is continuously differentiable and 
non-zero, we have 



Hm Jq E P [e(X t ,E Q X t )]dt-m\e P , Q (r,) = ^ 
/ T E P [£(X t ,E Q X t )]dt - cmlep Q (7) 



(52) 



i.e., the non-causal error approaches its low SNR limit twice as rapidly as the causal error. 

High SNR Behavior of D{P y t\\Q y t) : Assume the relationship between P and Q is sufficiently regular to imply 

(53) 

D(P y t\\Q y t) exhibits one of the following possible behaviors in the high SNR regime: 



lim cmlepn(7) = hm mlepn(7) = 0- 

7— foo ' 7— foo 



1. D(P y t \\Q y t) = for all 7 > 0. This can happen if and only if D(P\\Q) = 0, i.e., the non-mismatched 
setting. 

2. D(P y t\\Q y t) = 0(1), which can happen if and only if < D(P\\Q) < 00. 
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3. lim 7 _ i . 00 D(P y t || Qy t ) — oo but D{P y t\\Q y t) = 0(7), which can happen if and only if D(P\\Q) = 00. 
I.e., when D(P\\Q) — 00, D(P y t\\Q y t) increases without bound with increasing SNR, but sub-linearly. 

"Semi-Stochastic" Setting: Suppose that X t = A t , < t < T, A t > being a deterministic signal. Applying 
Theorem 5.5 with P degenerate on A T gives 



E 



t{\ u E Q [X t \Y*])dt 



1 

7 Jo 



7 



E 



£(X t ,E Q [X t \Y^})dt 



1 



da = -D(P Y t\\Q y t), (54) 



where here P y t is the law of a non- homogeneous Poisson process with intensity function 7 • A T , and the 
expectations are with respect to this measure. The relationship (54) can be thought of as the non-Bayesian 
version of ( 48 1 . 

We refer to Section 2 of [33] for the details leading to the above observations, as well as additional observations 
and results that, equipped with Theorem |5.5[ carry over verbatim from the Gaussian to the Poisson channel, such 
as the structure and performance of minimax causal estimators and their direct relation to minimax source coding 
via redundancy-capacity theory [TOl US [55] . 



6 Example: A DC Signal 

We now work out two examples. In both, the underlying noise-free process is a DC signal known to be such by the 
mismatched filter, the mismatch being only in the prior distribution on its amplitude. As simplistic as this scenario 
may be, it illustrates how some of the key observations made above play themselves out in concrete cases. 



6.1 Binary DC Signal 

Consider the case where X T is a binary DC process, i.e., X t = X, where X takes the values 1 and with probabilities 
p and 1 — p, and without loss of generality take T = 1. Suppose that the mismatched filter is designed knowing that 
X T is a binary DC process, but under the assumption that X takes the values 1 and with probabilities q and 1 — q. 
Subscripting with p and q to denote the respective measures, we have 



E p [X t \Y* = y] = 



pe 



1 if y > 1 

it 



l — p+pe 



ify = 



and 
where 

Thus 



E p [£(X u E q [X t \Y*})] =g(p,qnt), 



gip,i,i) 



cmle P , Q (7) = / g 
Jo 



(p,q,-ft)dt 



qe 7 (1 -p) 
l-q + ge-T 



-plog 



log 



1 — q + qe 
qe-f 



- 1 



1-9 



1 - q + qe 



pe 



(55) 

(56) 
(57) 



plog - + (p- l)log [l 



while 



mle P ,Q(7) = g{p,q,i). 



(58) 
(59) 



The curves cmlepQ^) and mlep.Q^) are plotted in Figure[2j along with those for the exact setting cmlep i p(7) and 



mlep ) p(7). Theorem 5.5 implies that the area of the dark rectangle is equal to the area under the curve of mlepp(7), 
which are both equal to the mutual information between the clean and noisy signal. It further implies that the sum 
of the areas of the dark and the light colored rectangles is equal to the area under the curve mlepQ(7), and both 
are equal to the mutual information plus the relative entropy between the true and the mismatched channel output 
distribution. Thus, the area of the light colored rectangle is equal to the relative entropy Also evident in the figure 



is the 'factor of 2' relationship of Equation (52), which holds for the two pairs of curves (mismatched and exact). 
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6.2 Support of P Not in Support of Q 

In the previous example, P and Q had the same support, which guarantees lim 7 _>. 00 mle^Q^) = lim 7 _ i . 00 cmlep,Q(7) = 
despite the mismatch. Also, in that example mlepg^) and cmlep i Q(7) were monotonically decreasing with 7. In 
general, neither of these properties need to hold, as was mentioned in Section 5.3 and is illustrated in the following 
extreme example . 

Suppose that, under P, the signal is deterministic and constant at 1/2, i.e., X t = 1/2 for all < t < T = 1. 
Under Q, X T is a binary DC process as in the previous example with q = 1/2. Using (55 1, an elementary calculation 
yields 

E P [i(X t ,E Q [XM)]=f{ rr t), (60) 



where 



Thus 



/(7)^(l/2,l)-(l-e-^ 2 )+£ 1/2 



1 + e"T 



- 7 /2 



mle PiQ (7) = J( 7 ) 



(61) 
(62) 



while elementary manipulations give 



cmlep :Q (7) 



f{lt)dt 



24 + 24e~ 7/2 + 37 2 + tt 2 - 24 ■ Gudermannian (-^j + 7 121og2 - 24 log 2 



1 

'247" 

-e~ 7/2 241og2 + 12 7 log [1 



127 log 



; 7 Cosh 



12 • Li 2 (-e 7 ) 



(63) 
(64) 
(65) 



where Gudermannian(x) 



2 arctan (e x ) — tt/2 and Li 2 (x) ■■ 
In accord with Theorem 



SfcLi xk /k 2 - Figure [3] displays the curves of mle^Q^) 



5.5 



the area of the rectangle is equal to the area under 



and cmlep i Q(7) in (62| and (65). 
the curve of mlep g(7), which are both equal to the relative entropy between the true and the mismatched channel 
output distribution, noting that the mutual information in this case is zero since the clean signal is deterministic. 

This example shows that mlep^^) and cmlep^^) need not vanish with increasing 7, as would be the case in 
the absence of mismatch. In fact, in this extreme example, mlep^^) and cmlepg(7) not only are not vanishing 
with increasing 7 but are increasing without bound because the mismatched distribution has positive mass at and 
therefore the conditional expectation under it is very close to zero when the channel output has zero occurrences, 
while the underlying signal value is 1/2. In this case the incurred loss is very large as ^(l/2,i) is unbounded in a 
neighborhood of 0. So large, in fact, as to cause the overall expected loss to grow without bound with 7, despite the 
diminishing probability of observing zero occurrences at the channel output. 



7 Proofs via Change-of-Measure Formulae 



As noted in Section |4j Theorem 4.1 is immediate from Theorem 4.2 once (22 1 is established. Towards this end note 
first that 



D(PyJQy^) < D(P X , Y JQ X)Y J 

= D{P\\Q)+D{P y ^ x \\Qy,\x\P) 
= D{P\\Q), 



(66) 
(67) 
(68) 



where the last equality is due to the fact that, P-a.s., Py \x — Qy y \x provided P Q (and otherwise D(P\\Q) = 00). 
The obvious monotonicity of the right hand side of (23) in 7, put together with (68), implies that the limit in (22) 
exists and satisfies 



lim D(P Y JQ Y) ) < D(P\\Q). 



(69) 
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For the reverse inequality one need merely note that 7 • Yy converges weakly to X as 7 — > 00 under both P and 
Q, and thus 

limM D(P Y JQ Yi ) = IiminfD(P 7 - l .yJQ 7 -i.y,) (70) 

> D(P\\Q), (71) 

where the inequality follows from the lower semi-continuity of relative entropy under weak convergence [BJ Appendix 
D.3]. 

We now note that Theorem |4.2| is a direct consequence of Theorem |4.4| Indeed, consider the setting of Theorem 
when 7 = 1 and X t = X for all t (and all lu) under both measures and note that in this case Ep [£(X t , EQ[X t \Yi})] 



4.4 



coincides with m\epQ(t) from the setting of Theorem 4.2 Thus an application of Theorem 4.4 in this very special 



case yields Theorem 4.2 (with T playing the role of 7 in the latter). It remains to prove Theorem 4.3 and Theorem 



|4.4[ to which we dedicate the respective two subsections that follow. 

7.1 Proof of Theorem 14.31 via Multivariate Point Processes 

The main idea is to think of the SNR level as 'time'. In fact, we will use special notation in this subsection, 
emphasizing this point of view, where t will denote SNR and z will denote the argument for the signal (that 
elsewhere in this paper is thought of as time). The index set for z, i.e. the signal's domain, plays but a secondary 
role in the analysis, and rather than assuming that it consists of the interval [0,T], we consider it to be a general 
multidimensional Euclidean set. Our approach is to relate the observation processes Y t at different SNR t to one 
another by constructing them as transformations of a single observation process that lives in a larger space, namely 
a multivariate point process driven by the signal. The main tool is the change-of-measure formula for multivariate 
point processes (16L 117) . that is a counterpart of the Girsanov formula for pure jump processes. This formula gives 
rise to a recursion in the variable t for the RN derivative, in the form of an integral equation, a key element of the 
proof. It is here where thinking of SNR as 'time' is useful. In fact, the role played by t (SNR) is analogous to that 
played by time in the proof of the result on causal estimation provided in the next subsection. 

Let d be a positive integer and consider a bounded Borel subset E of M. d . E will serve as the domain for the 
signal process. The given measurable space (.!?, J 7 ) may not be rich enough to support a multivariate point process, 
so we switch to a new space, and define on it new signal and observation processes, that (in the setting of Subsection 



4.2) share with the given signal and observation their joint distribution. Let a measurable space (Q,F) be given, 
and a random field X over E, taking values in R + , be defined on it. X is assumed to be measurable as a mapping 
from (fixE,J®^) to (R + ,TZ + ), where we write £ for the Borel cr-field 6(E) and TZ+ for Also defined on 

(J?,.F) are random variables Z n and T n , n> 1, with values in E and (0, 00), respectively. T n are strictly increasing 
and finite. The sequence (T n ,Z n ) forms a multivariate point process, and is characterized by the random measure 
on (0, 00) x E 

dt, dz) = ^2 s (T n (u),z n (w)) {dt, dz), 

n>l 

where S x denotes the unit point mass at x. Define a measure on (E,£) by i>(B) = J B X(z)dz, B € £. Let 

Q t = a{n((0, s] xB):s<t,Be£} 
T t = To V Q t 

where Jo = a{v{B) : B E £}. 

Let Po (resp., Qo) be a probability measure on (fi,P) under which X and n are mutually independent, X has 
some given law P (respectively, Q), and /j, is a Poisson measure (cf. [T71 p. 70]) relative to {T t } with (deterministic) 
intensity vq given by 

vo{dt, dz) = dtdz, 

where dt and dz denote Lebesgue measures on (0, 00) and E, respectively. This, by definition, means that E[/j,(B)] = 
Uo(B) for B € 1Z+ <g) £ , and that for any t > 0, /i(-,B) is independent of the cr-ficld Ft, provided B c (t, 00) x E. In 



^^Note that this is consistent with the assumption made in Subsection |4.2| on X T ; the more special structure assumed for X T , namely 
that it has paths in D, is not required here, and is used only in the result on causal estimation. 
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fact, \i is the counting measure for what is often called a Poisson point process on K + x E, and consequently, each 
fi(B) is a Poisson r.v. with parameter v (B), and fJ.(Bi) are independent provided Bi are disjoint (see [171 p. 105]). 
It is assumed that, under both Po and Qo, X is a. s. -bounded between two positive constants. 

We invoke an existence result, Theorem 5.2 of [TB]. Let v denote the random measure on (E x (0, oo), £ ®S(0, oo)) 

v(w; dt, dz) = v{<jJ] dz)dt = X(u>; z)dtdz. 

Then there exists a probability measure P on (/?, J), P < Pq, under which X has the same law P as under P , while 
v is a version of the ^-predictable projection of /i for P. We do not define this term here (see [IB]), but only note 
the consequence that, under P, conditionally on X, fi(Bi) are Poisson r.v. with parameters v(Bi), i.e. J„ X(z)dtdz, 
mutually independent across i provided Bi are disjoint. 

Furthermore, [TBJ Theorem 5.1] (see also [17j Theorem III. 5. 43]) gives an explicit version of the RN derivative 



At 



{ n 

n:T n <t 



exp / (1 — X(z))va(ds, dz), 



where E t = (0, t] x E (note that the processes a and Y of [THj vanish). This can be expressed as 



At = exp < / log X (z) fi(ds , dz) + / (1 — X(z))dsdz >. 

JE t JE t ' 



(72) 



Let yt denote the random measure on E, defined by 

y t (w,B) = fi((0,t]xB), 



B G £. 



Let y t — a{y t (B) : B £ £}. We can use (72) to calculate the RN derivative between the laws of observation process 
\x (respectively, y) under both measures, namely A t := -Ep Ll t |C/ t ] (A t := Ep Ldt|3^])- 

Lemma 7.3 Denote X t (z) = Ep[X(z)\Q t ] cind X t (z) = Ep[X(z)\y t ]. Then a version of both A t and A t is given by 



At = A t 



exp 



E t 



log X s -(z)fj.(ds,dz) + I (1 - X s -(z))dsdz\, 

E t J 



(73) 



where one can write X s —(z) in place of X s —(z) on the r.h.s. 



Proof: First note by (72 1 that A t — exp{J E log X (z)y t (dz) +t JL(1 — X(z))dz} depends on fi only through y t . Hence 
E Po [At\Gt] = E Po [A t \y t ], namely A t = A t . 

Next, note that Bp[X(^)|3^t] = EP E^[Atfy^ 1 ^ e ^ ave j us * argued that in the denominator one can replace 
yt by Qt- For a similar reason, the same is true for the numerator. Hence Ep[X(z)\y t ] — Ep[X(z)\Q t ], namely 
X t (z) = X t {z). To complete the proof it remains to show that A t is given by the r.h.s. of (73 1. 

Toward this end, we will argue along the lines of Result VI. R8 of [3]. Note that A t uniquely solves the integral 
equation 

(74) 



A t = l+ / A s -(X(z) - l)(fj,(ds,dz) - dsdz). 

JEt 



Indeed, this can be seen as a special case of [TBI equation (15)], or directly verified by noting that for t be- 
tween jumps dA t jdt = A t J E (1 — X(z))dz, while at times of jump f^ xR fi(ds,dz) = 1 and A t = A t -X(Z n ) = 

A t - J^ y . E X(z)/i(ds,dz). Using the relationship between (72) and ( [74| , with X s _(z) in place of X(z), it suffices to 
prove 



yl t = l+ / A s _(X s _(z) - l)(fj,(ds,dz) - dsdz). 



By (74) and the independence of X from Q t under Po, 

A t = l+ [ Ep [A s -(X(z)-l)\g s ](fi(ds,dz)-dsdz). 



(75) 



(76) 
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Thus (75) will follow once we show equality between the integrands in (75) and (76). Invoking Lemma VI. L6 of [4], 
showing this amounts to proving 



(0,*]xj 



C s A s -(X(z) - l)dsdz 



(0,t]xB 



C s A s ^(X s -(z) - l)dsdz 



(77) 



for any ^-predictable, bounded Ct and B G £. 
Toward showing (77), note first that 

E Po [ / A s -C s (X(z) - l)dsdz 

'-J(0,t]xB 

Indeed, using integration by parts, for each z, 



= E T 



(0,t]xB 



C s (X(z) - l)dsdz 



(78) 



A s _C s (X(z) - l)ds = A t / C s (X(z) - l)ds - 
Since A is a martingale under P , so is the last term on the right, and 



o Jo 



C u (X{z) - l)dA s . 



Ev 



(0,t]xB 



A s -C s (X(z) - l)dsdz 



= Ei 



Ai 



(0,t]xB 



C s (X(z) - l)dsdz 



This shows (|78). 

Next, since under P, the random measures X(z)dsdz and X s -(z)dsdz are the predictable projections of \x w.r.t. 
Ft and Gt, respectively (see [H>]), we have Ep J^ t ^ xB C s X(z)dsdz = Ep J, Q t ^ xB C s X s ^(z)dsdz. Thus 



E T 



(0,t]xB 



C s {X{z) - l)dsdz 



Ep 
Ep, 



(0,t]xB 



C s (X s ^(z) - l)dsdz 



(0,t]xB 



A s _C s (X s _(z) - l)dsdz 



(0,t]xB 



A s _C s (X s _(z) - l)dsdz, 



(79) 
(80) 
(81) 



where the second equality follows by an argument similar to that leading to (78). Displays (81) and (78) imply (77), 
which completes the proof. □ 

We use a convention, analogous to that used elsewhere in this paper, of writing Ep for expectation w.r.t. the 
joint law of X (signal) and yt (observation at SNR level t) under P. This is legitimate here as well since this joint 
law is determined by P and the conditional law of yt as the counting measure of a Poisson point process on E with 
intensity tX. This is valid also for conditioning, where we will write X^{z) :— Ep[X(z)\yt\ as Ep[X(z)\y t ], and for 
the law of yt, written P Vt . With this notation, applying Lemma 7.3 to both P and Q and noting that the law of yt 

dP 

under Pq does not depend on P (hence equal to that under Qo), a version of dQ vt is given by 



exp 



E t 



[log Xf_ (z) - log XZ (z)Hds, dz) - [ [Xf_ (z) - Xf_ (z)}dsdz\ 

JE t > 



Hence 



D{P II ,\\Q ! „) = E F I log^^ M (d S ,dz)- / (X?_(z) - X?_(z))dsdz 



(82) 



Note by |16| that an integral of the form J E R(s, z)(fi(ds, dz) — v(ds,dz)) forms a martingale under P, provided 
R is a predictable integrable process. The predictability of log Xf_(z) / X®_(z) follows from left continuity, while 
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integrability follows from the boundedness assumptions we put on X. Recalling that v(dt,dz) is given by X(z)dtdz, 
we have 



D{Py t \\Qy t ) = E P \ [ (x(z)\ g^ 



- Xj,_(z) + X?_{z))dsdz 



E P 



E P [X(z)\y s ] log ^[ffi'ffi - E P [X(z)\y a ] + E Q [X{z)\y s ])dsdz 

E t ^ E Q [X(z)\y s \ ) 



(83) 
(84) 



Changing the order of integration, we have thus established the following. 
Theorem 7.6 For t > 0. 

D{P yt \\Q Vt )= f I 
Jo Je 



E P [£ (E P [X(z) \y s ] , E Q [X(z) \y s ])} dzds. 



Proof of Theorem 4-3' Specializing to E = [0, T] and setting Y 1 {z) = y t ([0, z]), z £ [0,T], where 7 = t, gives for 



(X,Y) precisely the law indicated in Subsection 4.2 for the signal-observation pair. The result thus follows from 



Theorem 7.6 as a special case. □ 



7.2 Proof of Theorem 14.41 

The tools here parallel those of the previous subsection, but are simpler in that change-of-measure considerations 
are used only for usual point processes. Naturally, t here will denote time. The treatment cannot be regarded a 
special case of the one for multivariate point processes, because the signal varies with time (whereas there was no 
dependence of the signal on SNR). A related (but distinct) calculation is performed in [2Ql Section 19.5] for mutual 



information. Here we use a result from Section VI. 5 of [4], playing a role similar to that of Lemma 7.3 from the 
previous subsection. 

Let P, Q £ V be given. Letting P 7 (resp. Q 7 ) stand for the distribution of 7 • X T when X T ~ P (resp. X T ~ Q), 
by the third property in Lemma |2.1| we have 



cmlep Ti Q T (7) = 7 • cmle P:Q (7), 



(85) 



implying it suffices to prove Theorem 4.4 assuming 7 = 1. Thus, assuming X T = {X t ,t G [0,T]} is a non-negative 
process and that, conditioned on X T , Y 1 is a Poisson process of intensity X T , it suffices to prove 



D(P Yt \\Q Yt )= f E P [l{X Sl E Q [X s \Y s \) - £{X Sl E P [X s \Y s ])]ds 
Jo 



which, in view of Lemma 2.2 is equivalent to 



D{Pyt\\QYt)= f E P [£(Ep[X s \Y s },E Q [X s \Y s })}ds. 
Jo 



(86) 



(87) 



Toward proving the above identity, let P, Q be members of Vi corresponding to P and Q, respectively. Also 
consider an auxiliary probability measure Po on (S1,J-) under which X T and Y T are mutually independent, X T is 
distributed P, while Y T is a standard Poisson process. Denote 



A t = exp 

Then A uniquely solves the equation 



(o,t] 



log X s dF s 



■J (l-X s )ds}. 



1+ / A s _(X s -l)dY s + A a (l-X s )ds, 
(o,t] Jo 



(88) 



(89) 



and, under Po, it is a nonnegative martingale with expected value 1. As a result, -j£- = At defines a probability 
measure P on (f2, J 7 ). By the analogue of Girsanov's theorem for point processes [2TH Theorem 19.4] (cf. also Theorem 
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19.10 therein; or use the result from the previous subsection for the case where E is a singleton), the joint law of 
(X T , Y T ) under P agrees with that under P. Consequently, At is a version of the RN derivative between this joint 
law under P and under Pq, and denoting by R Y t the law of Y t under Pq, writing Eq for Ep , we have 



dPyt 

dRvt 



= A t :=E [A t \Y% te[0,T\. 



(90) 



Result VI(5.6) of [4] (alternatively, an argument along the lines of Lemma 7.3 based on (88), (89)) states 



A t = exp 



log X?_dY s 



(o,t] 



(l-X^ds , 



(91) 



where Xf = _Ep[X t |F*]. Noting that R Y t is nothing but a standard Poisson process (in particular, not depending 
on P), 



dR Y t 



exp 



log xf_dY, 



(o,t] 



(1 - X?_)ds 



whence 



D(P Yt \\Q Yt ) = E P 



(o,t] 



log^dn- / (.v.:; - a-.:- .),/, 

x^_ 



(92) 



(93) 



Under P, Y t — f Q X s ds is a martingale, hence so is L Z s (dY s — X s ds) for any predictable integrable process Z. Thus 



D(Py4Qy*) = Ep 



X s \og^-X^+X^_)d S 



E P 



X Q_ 

E P [X S \Y S ] log f P 5^|^ S J - E P [X S \Y S ] + E Q [X S \Y* 



E Q [X S \Y' 
Ep[l{Ep[X s \Y s ],E Q [X s \Y s ])] ds, 



ds 



(94) 
(95) 
(96) 



where the equality in ( 95 1 follows by conditioning on Y s . We have thus established ( 87 1 and completed the proof. 



7.3 The Presence of Feedback 



An inspection of our proof of Theorem 4.4 reveals that it carries over verbatim to accommodate the presence of 
feedback. Specifically, the relationship 



D(P y t\\Q y t)= [ E P [l{X t ,E Q [X t \Y t ]) - £(X t ,E P [X t \Y t ])]dt 
Jo 



(97) 



continues to hold in the case where, under both P and Q, the output process {Y t }t>o is a point process which admits 
the predictable intensity X t , adapted to a suitable filtration. Indeed, the analogue of Girsanov's theorem for point 
processes that we have employed in the proof accommodates this level of generality. As for the case of non-causal 
estimation, it is easy to find examples involving the presence of feedback where Equality (25) no longer holds. 



8 An Alternative Proof Route 

We dedicate this section to an outline of an alternative route for proving the main results of this paper, namely 
Theorems |4.2[ |4.3| and |4.4| In this alternative route, rather than starting with the continuous-time setting and 
then obtaining Theorem |4.2| as a consequence of Theorem |4.4| our starting point is a proof of Theorem |4.2| that is 
based on first principles. We then extend it to random rt-vectors, an extension that follows rather directly from the 
scalar case. Theorem |4.3| is then proven by 'lifting' from finite dimensional vectors to continuous-time processes, 



1G 



establishing first that, when the underlying noise- free signal is piecewise constant, the relative entropy between the 
true and mismatched distributions of the channel output process coincides with that between the true and mismatched 
laws of the counts at the edge points of the constancy intervals. This fact allows to appeal to the vector case result 



to establish Theorem 4.3 for said piecewise constant processes, the general case then following approximation and 



limiting arguments. Theorem 4.4 on the other hand, can also be proven using Theorem 4.2 as the main building 
block and an appeal to the chain rule for relative entropy. We omit the details and refer to [33, Section 4-C] for a 
totally analogous approach taken therein for the Gaussian case. 

The merit in this alternative route is in providing further intuition and insight into why the main results hold. 
We give an elementary proof of Theorem |4.2| and the subsequent results are seen to stem from Theorem |4.2| in a 
natural way. Because we have already provided rigorous proofs of the main results in the previous section, our goal 
here is only to outline the main ideas of this alternative proof route, and thus throughout this section we tacitly 
change orders of limits, summations, integrations, differentiations, etc. 



8.1 Outline of an Elementary Proof of Theorem |4.2 

Noting that 



E 

y=0 
oo 

E 

y=0 



y 



dP{x) 



e-T x (7a;)» 



dQ(x) 



J e-f x xydQ(x)' 



(98) 

(99) 
(100) 



we have 



oo 

E 

y=o 

oo 

E 

v=o 



D(PyJQy^) 



( 7 a;)« 



dP(x) ■ log 



-xe 7X (jx) y 



/ e-^ x x y dP(x) 



dP(x) ■ log 



/ e-~* x xydQ{x) 

J e-*> x x y dP(x) 
J e-^ x xVdQ{x) H 



E 



E 



^^dPix).^^^ 



d l I e~i x xydQ(x) ' 



(101) 
(102) 

^-^-\ dP{xyio j_^^m (103 ) 



(y-i)! 



/ e—> x xydQ(x) 



(104) 



Considering the second sum in (103), 



Y,i xe Z {l T l dP{ X )-ioJ e ~ 1XxVdP{x) 



(tf-l)l 



f e-~> x xVdQ(x) 



y=o 



f e -~> x xy +1 dQ(x)' 
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and therefore the overall expression in (|103[) assumes the form 



E 

y=o 

oo 

E 

y=o 

oo 

E 



y=i 



(y-i)! 



J e—f x x y dQ{x) 



arcTi^a;)*' J" e-^a^dP^) J e^ x x y dQ(x) 

y\ ' ° S J e-i x xV +1 dQ(x) J e~^ x xydP{x) 



-i x { lx )v 



dP(x) 



J xe-^ x { 1 x) y dP{x) J e-^ x X y +1 dP{x) J e-< x x y dQ(x) 



log 



/ ' e-~* x {~fx)ydP{x) J e~~< x xydP(x) J e~^ x xy +1 dQ{x) 



(a) 



J2P(Y 1 = y)E P [X\Y 1 = y] log 



y=o 



E P [X\Y^ = y] 
E Q [X\Y y = y] 



= E F 



EpiX^} log 



Ep[X\Yy] 
Eq[X\Y 7 ] 



where (a) follows upon noting 



Je-^xy^dPjx) Jx^^dP(x) _ 

fe-~f x x ydP(x) J ^^E dPjx) 1 7 



(107) 
(108) 
(109) 
(110) 

(111) 



Turning to the expression in (1104 



E 

y=0 

oo 

E 

y=o 

oo 

E 

v=o 



•<{ lx )y 



~{ix)y 



. d , f e-~> x x v dP(x) 

dP{x) • — log 4 

c?7 / e-y x x y dQ(x) 



(112) 



dP(a;) 



/ e^aM^x) / ~xe- lx x v dP{x) J e~^ x xydQ(x) - J e^ x X ydP{x) J -xe~^ x x y dQ(x) 



-r x (j X )y 



dP(x) 



J e~~< x xydP( 

J e~i x x y+1 dQ(x) J e-~* x x v+1 dP{x) 
f e-~i x xVdQ{x) J e-~i x xydP{x) 



(J e-~i x xydQ(x)Y 



= E P ( F 7 = V) ■ \EqIX\Y, = y}- E P [X\Y, = y}} 

= Ep[E Q [X\Y^}-E P {X\Y 7 ]]. 
Thus 



(113) 

(114) 
(115) 



~D(Py 7 \\Q Y7 ) C => E P 



E P [X\Y 7 ] log 



E P [X\Y y ] 



^E Q [X\Y y ] 
Ep[£(E P [X\Y^E Q [X\Y 7 })} 



■E Q [X\Y^] - EpIXIY^} 



(116) 
(117) 

(118) 
(119) 

where (a) follows from combining (103 1, (104), (110) and ( |115 ), and (b) from the (conditional version of) Lemma 



(6) 



E P [t (X, E Q [X\Y y }) - I {X, E P [X\Y 7 })} 
mlep !Q (7) - mle P) p(7), 



8.2 Extension to Random Vectors 

Let X n = (Xi, . . . , X n ) be a random n-tuple with non-negative components and, for 7 > 0, let Y™ — (K^i, ■ ■ ■ , Y 1:n ) 
be jointly distributed with X n such that, given X n , the components of Y™ are independent with Yy^\X n ~ 
Poisson(7Xj), 1 < i < n. If X n ~ P we denote the distribution of Y™ by Py*, conditional expectations by 



SpfXijlT], etc. We extend the notation in (20) to this case by 



A 



mle P:Q (7) = E P 



(X u E Q {X t \YI' 



(120) 
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With this extension of the notation, Theorem |4.2| carries over to random vectors essentially verbatim: 



Theorem 8.7 For any pair of probability laws P,Q governing X n with components bounded between two positive 
constants, and for any 7 > 0, 



D(P y A\Qy™) = / [m\e PtQ (a) - m\e Pt p(a)]da. 
Jo 



(121) 



Outline of proofF\ For fj = (ft, . . . ,/?„) g [0,oo) n , let Yg 1 be jointly distributed with X n such that, given X n , the 
components of Y^ are independent and Yp^\X n ~ Poisson(ftJQ), 1 < i < n. Then 



^D(Py,\\Q Y ,) = £ — D {Pr.\\Q Y .) 



-(n>—>i) 



(a) 



(b) 



dft 
d 



E 



/; ;, " \?W*f* 

D (P v . „ v 



y$ ay; 



y p ay; 



pynM 



®Y^ t \Y^ 



Py«\< 



/3=(7,->7) 



d 



dft 



YpAv; 



^dP^yf) 



/3=(7,--,7) 



dfSi V^AvT 



YfiAVp 



^)dP Y My^) 



n 

J2 E P [HXuEqIX^YP]) - KX^EplX^])] 



/3=(7,-,7) 

V = V 



dPyMvT) 



(122) 
(123) 
(124) 
(125) 
(126) 



(127) 

/3=(7,---,7) 



/3=(7,-,7) 



= J2 E ? [l{Xi>E Q [Xi\Yf]) - l[X u Ep[Xi\Y^])} 
= mlep. )Q (7) - mle P . P (7), 



(128) 

(129) 
(130) 



where Y^ 1 denotes the n— 1-tuple (Yp 1, . . . , Yg Yg . . . , Yg „), (a) is an application of the chain rule of 

Tel), (b) is 

an application of Theorem 



relative entropy (16), (b) is due to the fact that the distributions P v ^\i and Q v ^\i do not depend on ft, and (c) to 

r /3 r /3 



4.2 



with the roles of P and Q played respectively by P Y , n \i and Q Y , n\* 



□ 



8.3 Outline of Alternative Proof of Theorem 14.31 

We start with: 

Lemma 8.4 Lei X be a non-negative random variable, jointly distributed with the pair (Yj, Y2) smc/i £/iat, condition- 
ally on X, Y\ and Yjj are mutually independent, with both Y\\X ~ Poisson(X) and Y%\X <~ Poisson(X) . Letting P 
and Q correspond to two distributions on X : 



D(Py u y 2 \\QyuY 2 ) = D(P Yi+Y2 \\Q Yi+ y 2 ). 



(131) 



2 Theorem |8. 7| follows immediately from Theorem |4.3| by specializing the latter to piecewise constant processes. In the outlined proof 
given here, however, the idea is of course not to rely on Theorem |4.3| and assume only Theorem |4.2| whose elementary proof was outlined 
in the previous subsection. 
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Proof of Lemma \8.4\ Denoting Z = Y\ + Y%, for any x > 0, integer z > 0, and integer < y\ < z 

Pyi,z\x(Vuz\x) 



Py^z.xiyAz^x) 



Pz\x{z\x) 

yi\ (z-yi)! 
e- 2 *(23:)* 



Ift!(ar-yi)!2* 



1 



z 
(Ji 



1 



which does not depend on the value of x and therefore 

Py 1 \z(vi\z) 

In particular, the induced conditional distribution of Y\ given Z does not depend on the distribution of X so 



z 
Vi 



Yt\Z 



= Qy x 



(132) 

(133) 
(134) 
(135) 

(136) 
(137) 



Thus 



D{Py u y 2 \\Q Yi ,y 2 ) ( = } D{P YuZ \\Qy u z) 

= D(P z \\Q z ) + D(P Yllz \\Q Yllz \P z ) 



D(P Z \\Qz) 



(138) 
(139) 

(140) 



where (a) is due to the fact that there is a one-to-one transformation from (Y 1 ,Y 2 ) to (Yi, Z) and (6) is due to ( p7] . 



□ 



Iterating Lemma |8.4| by halving intervals gives 



Corollary 8.4 Let X be a non-negative random variable X, and let Y T \X be a homogeneous Poisson process of 
intensity X . Letting P and Q correspond to two distributions on X , 



D(P yt \\Q y t) = D{P Yt \\Q Yt ). 



(141) 



Equipped with Corollary 8.4 we can start to prove Theorem 4.3 by establishing (25) under the assumption that X 
is piecewise constant both under P and under Q, i.e., assume existence of n and a random vector of nonnegative 
components A n = (Ax, . . . , A n ) such that 



X t = Ai for all * l -T <t<-T, and 1 < i < n. 

n n 



(142) 



We use P and Q to denote either the measures governing the continuous-time piecewise-constant signal X T satisfying 
( 142 1, or those governing the n-dimensional random vector A n . To make the distinction clear, we add the superscript 
vec in the latter case writing mlep 0(7) for rnlep^^) of Section 8.2 (recall ( 120 )) while staying with mle^Q^) for the 



continuous-time analogue in ( 24 1 . In the same vein, we let Y* 



(Y* e i, . . . , Yy^) denote a random n-tuple jointly 



distributed with A n such that, given A n , the components of Y^ ec ' n are independent with Y*f\A n ~ Poisson^A,), 
while staying with Y^ — {Yy,t, < t < T} to denote the process which, conditioned on X T , is a Poisson process of 
intensity ^X T . With this convention, we have 



mlep lQ (7) 



T 



-P,Q ^ 



T 



(143) 
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and thus 



j [mlep ) g(a) — m\e PP (a)] da = — / 
Jo n Jo 



mle 



p,Q 



T 



mle^ ec p 



T 



da 



[mlepg (a) — m\e v PP (a)] da 



(c) 
(d) 



DP, 



'■yT/n" tT/„ 
£) ( P {>'-,, I T/„-^,( I -l)T/„}r =1 ll ( 9{K,, I T/„-K r .( ! 

D(P FJ ||Qv r ), 



■l)T/»}" = l) 



(144) 

(145) 
(146) 
(147) 
(148) 
(149) 



where (a) follows from (143), (b) follows from Theorem 8.7 (c) is due to the fact that Y^ T j n is equal in distribution 
to {Y l iT / n ~ Kyjj-i-jx/nTiLi regardless of the distribution of the underlying A™, (d) is due to the fact that there 
is a one-t o-on e transformation from {Y yiT / n — 3^y,(j_i)T/ ra }" =1 ^° {^7,iT/n}?=i) an d ( e ) is due to an application of 
Corollary 8.4 on each of the constancy intervals of X T (and invoking the chain rule of relative entropy). 

This concludes the proof for the case where the input is a piecewise constant process of the form in ( |142[ ) . For 
general processes, given the two process distributions P and Q, one considers the induced measures PW and Q( n \ 
on the input process X^' T obtained from the original one via 



x t (n) = 



T 



»+l)2 _ "T 



X t dt for t e (i2~ n T, (i + l)2~ n T] , 



(150) 



i2-"T 



a piecewise constant process of the form in (|142| for which we have already established the result. Thus, for any n, 

(151) 



D ( PyT || Qy T 



ile P („) Q(„)(a) - mle P (») )P („)(a)] da. 



Standard continuity arguments similar to those in |33[ Section IV. C] would now yield 



and 



D (P^IIQ^J ™ D [PytWQyt^ 

' [mle P (n) t Q(«)(a) - m\e pMi p(„)(a)] da / [m\e PtQ (a) - m\e PiP (a)]da, 

o ' Jo 



(152) 
(153) 



implying (251 when combined with (151 1. 



9 Conclusion 

Under the right loss function, we find that the Poisson channel exhibits relationships between mutual information, 
relative entropy, and mismatched estimation loss - for the causal and the non-causal filter- completely paralleling 
those recently found for the Gaussian channel. For the non-mismatched setting, our findings shed light on the classical 
continuous-time mutual information relation (43) (cf., e.g., [20]), as well as the recent ones of [Mj, endowing them 



with optimal estimation interpretations that complete the analogy to Gaussian channel results such as Duncan's 
formula [7] and the I-MMSE relationship fT3"] . 

To what extent our findings can be applied to scenarios involving Poisson channels - analogously to the way their 
Gaussian counterparts were used in, e.g., [TS] for multiuser channels, [55] for analysis of sparse-graph codes, and 
|30j to establish monotone convergence to a Gaussian limit under relative entropy - remains to explore. It would 
also be of interest to see whether our results might lead to insight into or improvements on Poisson approximation 
results, such as those in [5] and references therein. Finally, it would be valuable to understand whether and how the 
relationships that we now know to hold for the Gaussian and the Poisson channel carry over to other channels. Steps 
in related directions have been taken recently, establishing that derivatives of information measures with respect 
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to parameters governing the channel induce functionals involving the conditional distribution of the input given 
the output, cf. e.g. [531 EH [M] ■ It remains to be seen whether the latter admit interpretations of operational 
significance corresponding to optimum or mismatched estimation, and whether these can, in turn, be used to infer 
insightful relations such as between causal and non-causal estimation. 
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(a) 1(1, x) (b) £ (x) =£(x,l) = zlogz - a; + 1 



Figure 1: The loss function £ 
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Figure 2: The curves mlepp(7), cmlep^^), mlepQ(7) and cmlepQ(7), 
example in Section 6.1 plotted here for p = 1/2 and q = 1/5. 



marked respectively by A,B,C,D, of the 




Figure 3: The curves of m\ep_Q(j) and cmlep.Q(7), as expressed in (62| and (65), marked respectively by A and B. 
In this example, the mismatched non-causal filter performance is worse than that of the causal one. 
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